Check the Rmd file for code to do polite introductions to these web sites.
Motivated by Ryo’s blog post, we are going to scrape soccer records from Wikipedia.
We’re going to first extract records on Asian Cup men’s soccer. This can be done with the code below.
How can we know that the data is organised as a table in the html? Go to the web site in your Google Chrome browser and use InspectorGadget to see the elements in the html.
How did we know to extract the data from the 6th table? Try changing the 6 in the code html_table(tbl[[6]], fill=TRUE) and compare the result with the tables presented in the web page.
Let’s use the data to make a chart. Examine the goals for and against, of each team. add a guide line where the for and against are equal. Also make the plot interactive, using plotly, so you can browse over team names. Which teams have scored more goals than have been scored against them, over all time? How does this compare with the champions list?
Inspect:This will bring up the page source. Can you see the <table tag? and see the same information that is visible in the web page itself?
fetch_cricinfo function from the cricketdata package, extract all the records for Australian women’s T20 matches. And answer the following questions:# remotes::install_github("ropenscilabs/cricketdata")
library(cricketdata)
auswt20 <- fetch_cricinfo("T20", "Women", country="Aust")
## # A tibble: 1 x 1
## n
## <int>
## 1 53
## # A tibble: 1 x 2
## Start End
## <int> <int>
## 1 2005 2020
## # A tibble: 53 x 2
## Player Matches
## <chr> <int>
## 1 EA Perry 120
## 2 AJ Healy 112
## 3 MM Lanning 104
## 4 AJ Blackwell 95
## 5 JL Jonassen 79
## 6 RL Haynes 67
## 7 M Schutt 67
## 8 JE Duffin 64
## 9 EJ Villani 62
## 10 EA Osborne 59
## # … with 43 more rows
fetch_cricinfo. The work is mostly done by another hidden function cricketdata:::fetch_cricket_data. A key part of this function occurs in these lines:url <- paste0("http://stats.espncricinfo.com/ci/engine/stats/index.html?class=",
matchclass, ifelse(is.null(country), "",
paste0(";team=",
team)), ";page=",
format(page, scientific = FALSE),
";template=results;type=", activity,
view_text, ";size=200;wrappertype=print")
Try setting these values, and creating the URL manually. When you have it right, you will have found the page that the data is extracted from! (Alternatively, if this is too frustrating try working with the “statguru” query tool to find the table of interest.)
Using your URL, use the read_html and html_table functions to reproduce what the cricketdata package did for you.
Note that the package returned 53 records because there were two pages of data. The manual code only scraped one of the two pages.
🤔 Use the inspect tool on the web page, to see that it is indeed an html table.
Sometimes pages are dynamically created, which means that it isn’t possible too directly extract the data. The women’s tennis records web site uses dynamic web pages.
url <- "https://www.wtatennis.com/stats"
wta_html <- read_html(url)
wta_rankings <- html_node(wta_html, "table")
# Save web page source locally, because it contains javascript content
wta_html <- read_html("wta_rankings2.htm")
wta_rankings <- html_node(wta_html, "table") %>% html_table(fill=TRUE)
# There is only one table in page so use html_node rather than html_nodes
wta_rankings <- wta_rankings %>%
janitor::remove_empty() %>%
as_tibble()
This time you’ve got data. How can you tell that the page is dynamic? Inspect the source, and you will find script. This often indicates that some javascript is used to create the table.
Use your wrangling skills to clean up the data, making numeric variables numeric
Make a plot with the data you’ve just extracted. Your choice.